-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌱 fix(e2e): wait for leader election #1676
🌱 fix(e2e): wait for leader election #1676
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
a96e69f
to
6b04b01
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1676 +/- ##
==========================================
- Coverage 67.50% 67.48% -0.03%
==========================================
Files 57 57
Lines 4632 4632
==========================================
- Hits 3127 3126 -1
- Misses 1278 1279 +1
Partials 227 227
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
0ee61e2
to
4238c2b
Compare
@@ -40,6 +40,20 @@ func TestClusterExtensionAfterOLMUpgrade(t *testing.T) { | |||
t.Log("Wait for operator-controller deployment to be ready") | |||
managerPod := waitForDeployment(t, ctx, "operator-controller-controller-manager") | |||
|
|||
t.Log("Start measuring leader election time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to be careful about how we measure the timing here. What we are measuring right now is the amount of time between:
- the test detecting that the operator-controller deployment is finished, and
- how long it takes for
watchPodLogsForSubstring(leaderElectionCtx, managerPod, "manager", leaderSubstrings...)
to return
This may correlate with the time taken for leader election, but it won't necessarily correlate with it. E.g. let's say I upgrade the deployments, go out for lunch for 1h, come back and run the post upgrade test.
Maybe it would be better to extract the timestamp in the first and leader election log lines instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your comment make 100% sense.
To try keep things simple and focused on the goal of this PR, I’ve removed the measurement aspect. Whether we want to include it as info, debug, or decide on a specific measurement approach is a separate discussion. For now, let’s stay within the scope of this change—fixing the test flake and unblocking progress.
TestClusterExtensionAfterOLMUpgrade was failing due to increased leader election timeouts, causing reconciliation checks to run before leadership was acquired. This fix ensures the test explicitly waits for leader election logs (`"successfully acquired lease"`) before verifying reconciliation.
4238c2b
to
25ffe30
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm ^^
t.Log("Wait for acquired leader election") | ||
// Average case is under 1 minute but in the worst case: (previous leader crashed) | ||
// we could have LeaseDuration (137s) + RetryPeriod (26s) +/- 163s | ||
leaderCtx, leaderCancel := context.WithTimeout(ctx, 3*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming 3 minutes is the worst case scenario. I am not familiar with context.WithTimeout
, does it return if we acquire the lease before 163s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
context.WithTimeout
just gives you a context that timesout (gets cancelled) after then timeout period.
This means that the call to watchPodLogsForSubstring(leaderCtx, managerPod, "manager", leaderSubstrings...)
will return with an error if it hasn't already after 3 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, looks like it is a straight forward timeout method.
defer leaderCancel() | ||
|
||
leaderSubstrings := []string{"successfully acquired lease"} | ||
leaderElected, err := watchPodLogsForSubstring(leaderCtx, managerPod, "manager", leaderSubstrings...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scraping the logs seems brittle.
Would it be better to use a Watch on the leader election? We could use the Leases from CoordinationV1Client from "k8s.io/client-go/kubernetes/typed/coordination/v1" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize it's also longer and more code, but the upside is it reacts right away, like watching for the pod log, but without caring if strings change at some point, and break our tests out of our control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good idea! If this work is blocking CI, I'd say merge it as it is, then follow up with the watch ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that we could do something more fancy
But we check the logs in many places, indeed below.
We can see if we improve after, but there is no reason for us to face the pain of the flak.
TestClusterExtensionAfterOLMUpgrade was failing due to increased leader election timeouts, causing reconciliation checks to run before leadership was acquired.
This fix ensures the test explicitly waits for leader election logs (
"successfully acquired lease"
) before verifying reconciliation.Example: https://github.com/operator-framework/operator-controller/actions/runs/13047935813/job/36401741998
Logs from operator-controller;